Figure 1

In this first example, I wanted to investigate the relationship between HOLC zoning from the early mid-20th century and the current racial makeup of those same zones to ascertain the lasting legacy of these housing policies.

First step is to load in our packages and data from github. We will do this for every example.

library(RColorBrewer)
library(tidyverse)
library(reshape2)
holc <- read.csv("https://raw.githubusercontent.com/SOCatCC/DAV/main/Data/holc.csv")
ls(holc)
##  [1] "Can_P"      "city"       "grade"      "holc_grade" "holc_id"   
##  [6] "pasiannh"   "pbachup"    "pblacknh"   "pcan"       "platinx"   
## [11] "plesshs"    "pmin"       "pown"       "ppov"       "pwhitenh"  
## [16] "Sum_asiann" "Sum_bachup" "Sum_blackn" "Sum_hsdeg"  "Sum_latinx"
## [21] "Sum_lesshs" "Sum_occ"    "Sum_ownocc" "Sum_pop25"  "Sum_poptot"
## [26] "Sum_povbel" "Sum_povsta" "Sum_rentoc" "Sum_whiten"
holc1 <- holc %>%
  group_by(city,grade)%>%
  summarise(medasian = median(pasiannh,na.rm = TRUE),
            medblack = median(pblacknh,na.rm = TRUE),
            medlatin = median(platinx,na.rm = TRUE),
           medwhite = median(pwhitenh,na.rm = TRUE))

holc2 <- holc %>%
  group_by(grade)%>%
  summarise(sumasian = sum(Sum_asiann,na.rm = TRUE),
            sumblack= sum(Sum_blackn, na.rm = TRUE),
            sumlatin = sum(Sum_latinx, na.rm = TRUE),
           sumwhite = sum(Sum_whiten, na.rm = TRUE))
holc3 <- melt(holc2)
names(holc3)[names(holc3) == 'variable'] <- 'race'
names(holc3)[names(holc3) == 'value'] <- 'count'

red <- read.csv("https://raw.githubusercontent.com/aniljerg/Anil/main/redliningduplication.csv")

The next step is to manipulate our data into a form that ggplot will make into a barplot. Using the total count of each person by race in each grade, the data is melted to make it line up vertically as seen in the second data.frame window. To make a bar plot of proportions, ggplot2 needs two columns that it counts itself. In this case, the first column was grade, and then it collected counts and found proportions based upon the race. To get the data in this format, rather than just having a third count variable, I used an Excel extension that would duplicate columns. The extension was very inefficient, so after letting it run trying to create 153,863 new rows of [A, sumasian] and so on for about an hour I realized it might be best to cut down the number of rows. I rounded each row to the nearest thousand (i.e. 153,863 = 154) and let it run for about an hour and a half until it was finished. This new data set is our “red” object.

ggplot(red)+
  geom_bar(aes(x = as.factor(grade), fill = race), position = "fill")+
  theme_minimal()+
  labs(subtitle = "Racial makeup of people living today in \n Home Owners' Loan Corporation (HOLC) map categories \n from the early-mid 20th Century",
       title = "How have racist zoning policies persisted?",
       caption = "Something About cities",
       y = element_blank(),
       x = "HOLC Grade")+
  theme(legend.title = element_blank(),
        text=element_text(size=12,  family="Georgia"),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  scale_fill_manual(values = c("#ed6a5a","#f0b67f","#9bc1bc","#388697"),
                    labels = c("Asian", "Black", "Latinx","White"))+
  scale_y_continuous(breaks=NULL)

This is our final chart. The code here is all stuff we’ve done before - the data prep was the hard bit. A basic geom_bar with a fill = command to get it to show proportions. Building off the minimal theme, I added titles, removed the y-axis label and renamed the x-axis. No legend title was needed and neither were the grid lines so those had to go. I spent way to long choosing this color palette, but I had my film major friend messing around with it with me so it was a good time. the only thing left to do was change the labels on the legend. I chose to go with ‘latinx’ as it is inclusive, well-understood and accepted by my audience for this chart.

In this chart, DEI factors were heavily considered in ordering, coloring, and creating the graph. A possible change would be to make the legend about people (i.e. Latinx -> Latinx people). This chart is made with a collection of numbers, but the most important thing these numbers can convey is an understanding of the people behind them. I chose to not use this as it makes the legend much larger and repetitive, but it would be prudent to discuss the people behind this figure in any writing on it, as we are doing here.

Figure 2

Next, we will do an investigation into the relationship between GDP per capita and life expectancy across countries and time periods.

gapminder <- library(gapminder)
gapm <- gapminder_unfiltered %>%
  mutate(pop = pop/1000000)

ls(gapm)
## [1] "continent" "country"   "gdpPercap" "lifeExp"   "pop"       "year"
library(ggrepel)
library(viridis)
library(tidyverse)
library(ggrepel)
library(patchwork)
library(scales)
library(plotly)
gdppercapscatter <- gapm %>%
   filter(year == 2007) %>%
ggplot(aes(gdpPercap, lifeExp))+
  geom_point(aes(color = continent, size = pop,labels = country))+
  scale_x_continuous()+
  geom_smooth(method = loess,se = FALSE, color = "grey 69")+
   labs(title = "Looking for Outliers in Life expectancy and GDP per Capita in 2007\nSized by Population \nFSU = Former Soviet Union",
       y = "Life Expectancy",
       x = "GDP per Capita")+
   scale_x_continuous(labels = dollar)+
   scale_size_continuous(labels = comma)+
   guides(color=guide_legend(title="Continent"), size = guide_legend(title = ""))+
  theme_minimal()+
   theme(text=element_text(size=10,  family="Georgia"))+
   annotate("text", x = 18000, y = 65, label = "Russia")
ggplotly(gdppercapscatter)

First we will make a scatter plot that shows the relationship between these two variables. Some might think it prudent to make the x-axis a log scale, but in this case I disagree. When the graph is in continuous form it more accurately shows the massive increase in life expectancy for the first part of the relationship, and the diminishing returns as GDP per capita continues to rise. In this case, the most obvious outlier is Russia due to its population and position well below the line of best fit. I have turned this into a plotly to make it easier to investigate countries one might be curious about. I am interested in Russia in this case, but why not make it easy for people to explore their curiosity? I have also annotated Russia to make it stand out. The country points are sized by population, but plotly will only take one legend at a time so there is no legend specifying this. In this case, I am okay with this because this use of sizing is pretty intuitive and common. Most people have seen this graph before, and even if they haven’t they have most likely seen something similar.

gapm1 <- gapm %>%
  filter(country == "Russia"|country == "United States" |country == "Angola" |country =="Germany"| country == "India" | country == "Japan" | country == "Estonia" | country == "Chile")

gapm11 <- gapm1 %>%
  filter(country != "Russia")

gapm12 <- gapm1 %>%
  filter(country == "Russia")

lespaghetti <- ggplot()+
  geom_line(gapm11, mapping = aes(year,lifeExp,group = country),colour = "cadetblue3",alpha = 0.4)+
  geom_line(gapm12, mapping = aes(year,lifeExp), color = "red")+
  geom_label_repel()+
  theme(panel.border = element_blank(),
    panel.background = element_blank())+
  labs(y = "Life Expectancy",
       x = element_blank())+
   annotate("text", x = 2010, y = 65, label = "Russia",color = "Red")+
   annotate("text", x = 2008, y = 76, label = "United States",colour = "cadetblue3")+
   annotate("text", x = 2010, y = 43, label = "Angola",colour = "cadetblue3")+
   annotate("text", x = 2009, y = 81, label = "Germany",colour = "cadetblue3")+
   annotate("text", x = 2010, y = 63, label = "India",colour = "cadetblue3")+
   annotate("text", x = 2010, y = 71, label = "Estonia",colour = "cadetblue3")+
   annotate("text", x = 2010, y = 83, label = "Japan",colour = "cadetblue3")+
   annotate("text", x = 2010, y = 79, label = "Chile",colour = "cadetblue3")
lespaghetti

We can start the investigation by comparing Russia’s Life Expectancy versus other countries. The two Soviet Union countries are noticeably different than their other counterparts as they are the only two countries that do not see somewhat stable progress as time goes on. Their deviation from each other post-1998 is also very interesting - Estonia begins to pull itself out of the slump, but Russia’s life expectancy stays somewhat level and below pre-perestroika levels. I have chosen to highlight Russia while keeping the other countries more opaque so as to make our country of interest most obvious. I removed the axes and the grid lines of this graph because they made it more difficult to see the trends of these multiple lines. I chose one country from each of the major continents, with two from Asia because of its massive diversity in Life Expectancy change. Other continents also have diversity of experiences like this, but not to the same degree, and adding many more would cloud the purpose of this graph.

gapmrus <- gapm %>%
  filter(country %in% "Russia")
  gdppcrus1 <- ggplot(gapmrus, mapping =  aes(x = year))+
    geom_line(mapping = aes(y = gdpPercap),color = "Blue")+
    theme_minimal()+
    labs(x = element_blank(),y = "GDP Per Capita and Life Expectancy",
         title = "Russian GDP Per Capita")+
    scale_y_continuous(name = element_blank(),labels = dollar)+
    geom_vline(xintercept=1998,lwd=1,colour="black")+
    annotate("text", x = 2000, y = 14000, label = "1998")

  gdppcrus2 <- ggplot(gapmrus, mapping =  aes(x = year))+
        geom_line(mapping = aes(y = lifeExp),color = "Red")+
    theme_minimal()+
    labs(x = element_blank(),y = "GDP Per Capita and Life Expectancy",
         title = "Russian Life Expectancy")+
    scale_y_continuous(name = element_blank())+
    geom_vline(xintercept=1998,lwd=1,colour="black")

gdppcrus2 / gdppcrus1

Finally we can compare Russian GDP per capita and life expectancy to see when the deviation occurred. Russian GDP begins to recover remarkably in the late 90’s, but the countries life expectancy remains stubborn. Guessing at why this is true is well beyond the scope of this, but it could be a result of the current kleptocracy siphoning away all of these GDP gains, rampant alcoholism, or many other reasons. I chose to layer these two graphs over each other to show their changes over time and to highlight the divergence between GDP per capita and life expectancy. The line and annotation are there so that I can show the exact year that this divergence occurred.

Figure 3

Let’s investigate the median household income distributions of Kentucky, Tennessee, Virginia, and West Virginia.

mining <- read.csv("https://raw.githubusercontent.com/SOCatCC/DAV/main/Data/mining.csv")
library(ggplot2)
ls(mining)
##  [1] "County"    "fips"      "mhi"       "minestat"  "minestatc" "minestatl"
##  [7] "pfairpoor" "phydays"   "plbw"      "ppov"      "State"     "ypllrate"
require(scales)
ggplot(mining)+
  geom_density(aes(mhi, fill = State), alpha = 0.4)+
  theme_minimal()+
 theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank())+
  scale_x_continuous(labels = dollar)+
  labs(title = "Median Household Income Distribution \nin the Central-Eastern United States",
       x = element_blank())+
scale_fill_manual(values = c("yellow","blue","grey","pink"))

Here we have a histogram presenting the income distributions of the four states. Overlaying them in this manner informs the distinctions between each state. I removed the y-axis labels because they are more distracting than they are helpful. The coloring in this case is a little suspect. I had a hard time finding colors that remained distinguishable even when they are all overlapping. I think this final iteration is acceptable, but I wish it were better. This may just be a limitation of this visualization.

Figure 4

Our fourth figure will focus on stats applications of visualization in r.

mining <- read.csv("https://raw.githubusercontent.com/SOCatCC/DAV/main/Data/mining.csv")
library(Hmisc)
library(DataExplorer)
library(statsExpressions)
library(ggstatsplot)
library(ggside)
library(ggcorrplot)
library(PMCMRplus)
mining$minestatc <- factor(mining$minestatc)
#create_report(mining)

If one removes the hashtag from the create_report function here, the DataExplorer package will automatically create a variety of statistical visualizations automatically in a different html page.

mine1 <- mining %>%
  select(pfairpoor,ppov,ypllrate,mhi)

ggstatsplot::ggcorrmat(data = mine1, corr.method = "spearman",sig.level = 0.005)+
  labs(title = "Correlation Matrix of Mining Data")

#Creating correlation matrix using ggstatsplot.

Here we have created a correlation matrix using ggstatsplot. It is one of a number of way one can visualize this information, but I like the color distinctions that it offers. Below are some more uses of ggstatsplot to easily create statistical representations of data.

Figure 5

This figure is another representation of the relationship of GDP per capita and life expectancy. This time it has been animated using gganimate. This is heavily based on the code Wade sent to me, but I have altered it to include labels on the outlying countries and increase the pixel count and frame rate of the animation.

library(gapminder)
library(gganimate)
library(viridis)
library(gifski)
library(RColorBrewer)
gapm7 <- gapminder_unfiltered %>%
  arrange(year)%>%
  filter(continent == "Africa")
gmplot <- ggplot(data = gapm7, aes(x = gdpPercap, lifeExp, size = pop,color = country,label = country)) +
     geom_point(alpha = 0.6, show.legend = FALSE)+
     geom_text_repel(max.overlaps = 20)+
     scale_color_manual(values = country_colors) +
  scale_x_continuous(labels=scales::dollar_format())+
     theme_minimal()+
     theme(legend.position = "none")+
     scale_size(range = c(2, 12)) +
     labs(title = "GDP per Capita's Impact on Life Expectancy in Africa",
      x = "GDP Per Capita",
      y = "Life Expectancy")
gmplot

gmplot.animation <- gmplot +
 transition_time(year) +
 labs(title = "Year: {frame_time}") +
 shadow_wake(wake_length = 0.01, alpha = FALSE)

 
 
animate(gmplot.animation, fps = 15, detail = 2, nframes = 200, height = 800, width = 800)

In this final example, I have mapped child poverty by county. I chose to do it this way to show the concentration of child poverty in the southern United States, with particular hotspots cropping up in rural parts of the West.